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Abstract. We consider multivariate skew-t distributions for modeling composition data of 
high energy cosmic rays. The model has been validated with simulated data for different 
primary nuclei and hadronic models focusing on the depth of maximum X m8/X and number of 
muons observables. Further, we consider mixtures of multivariate skew-t distributions for 
cosmic ray mass composition determination and event-by-event classification. With respect to 
other approaches in the field, it is based on analytical calculations and allows to incorporate 
different sets of constraints provided by the present hadronic models. We present some 
applications to simulated data sets generated with different nuclear abundances assumptions. 
As it does not fully rely on the hadronic model predictions, the method is particularly suited 
to the current experimental scenario in which evidences of discrepancies of the measured 
data with respect to the models have been reported for some shower observables, such as the 
number of muons at ground level. 



1 Corresponding author. 



Contents 



1 



Introduction 



1 



2 



Modeling the distribution of cosmic ray shower observables 

2.1 Model validation 



2 

4 



3 



A mixture model for mass composition analysis 

3.1 EM parameter estimation 

3.2 EM constrained 



5 

6 

8 



4 



Method application to data 

4.1 Choice of the number of components 

4.2 Fitting performances 

4.3 Classification performances 



9 

12 
12 
13 



5 



Summary 



14 



1 Introduction 

The nuclear composition of cosmic ray particles is a crucial observable to explore the origin of 
such radiation at energies above 10 18 eV. Different theoretical models have been advanced in 
this direction predicting a given flux from each nuclear component. The measurement of the 
relative abundances as a function of the primary energy at least for groups of nuclear com- 
ponents therefore provide a deep insight in the field and contribute to significantly constrain 
the present theories. 

At energies above the knee the measurement of the mass must necessarily be based on 
indirect techniques, making use of shower parameters sensitive to the primary mass. Among 
these, the depth X max at which the longitudinal development has its maximum and the num- 
ber of muons reaching ground level are the most powerful observables. A recent review 
of the mass composition methods and experimental results over the entire energy spectrum 
range is presented in [1]. 

At present, different statistical methods are employed depending on whether the composition 
information has to be reconstructed on an event-by-event basis. For instance this results to 
be extremely helpful when studying possible correlations of the shower arrival direction with 
given astrophysical objects, in proton-Air cross section analysis or in the search of gamma 
ray or neutrino events in the hadronic background. In all these situations pattern recognition 
methods, e.g. neural networks as in [2-4], linear discriminant analysis [5], supervised clus- 
tering algorithms as in [4], are typically adopted. To determine the nuclear composition on 
average, binned likelihood methods, as in [6], or unfolding analyses, as in [7], are preferred. 

The presence of stochastic shower-to-shower fluctuations and the experimental resolu- 
tion currently achieved in the measurement of the shower parameters impose a limit on the 
number of mass groups to be determined in both kind of analyses, typically less than five. 

All these methods directly compare the observed data with mixtures of distributions 
generated with detailed Monte Carlo simulations of the shower development in atmosphere 
for each nucleus. The only parameters to be optimized are the component weights. 
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Shower simulations strongly rely on theoretical models of the hadronic interactions ex- 
trapolated at energies well beyond those reached in modern accelerators. Among them we 
cite QgsjetOI [8], QgsjetII [9], Sibyll [10] and Epos [11]. Surprisingly, most of the shower 
observables measured so far at such extreme energies are well described or at least brack- 
eted by the hadronic model predictions, i.e. see [1]. Furthermore, the discrepancies among 
different models will be significantly reduced with incoming model releases accounting for 
recent LHC measurements, for instance see [12]. Recently, however, a significant discrepancy 
between measured data and Monte Carlo simulations has been reported in the muon num- 
ber estimator both for vertical [13] and for very inclined showers [14]. The origin of such 
discrepancy is currently under investigation. 

This scenario motivates the development of a statistical model of the shower observables 
and an unsupervised approach, incorporating the valid constraints provided by the hadronic 
models, for mass composition fitting and classification purposes. These represent the main 
goals of the present study, which is organized as follows. 

In Section 2 we explored the possibility of modeling all shower observables, focusing 
our attention on the depth of maximum X max and the number of muons iV„, by using a 
multivariate skew-t distribution. This and similar classes of skew distributions are receiving 
a growing attention in model-based clustering over the last few years, i.e. see the works by Lee 
and Mc Lachlan [15], Vrbik and McNicholas [16], Franczak et al [17] and Azzalini and Genton 
[18], due to their flexibility to model non-gaussian asymmetric data and the possibility of 
developing elegant and relatively computationally straightforward mixture solutions in the 
EM framework. The validation of the model with Monte Carlo shower simulations is reported 
in paragraph 2.1. 

In Section 3.1 we describe the clustering algorithm adopted for composition determination. 
It is mainly based on the work by McNicholas et al [16] for what concerns the analytical 
solutions to fit mixtures of multivariate skew-t distributions using the EM algorithm. With 
respect to the latter work we introduced in Section 3.2 a procedure to account for different 
kind of constraints in the optimization directly derived from shower simulations. Finally, 
in Section 4 we applied the method to sample data sets simulated with different nuclear 
abundance assumptions. The achieved fitting and classification performances are presented 
and discussed. 

2 Modeling the distribution of cosmic ray shower observables 

The fluctuations observed in the shower observables for a given primary nucleus and energy 
are related to the stocastic fluctuations in the position of the first interaction point in the 
top layers of the atmosphere and in the secondary interactions occuring along the shower 
development. The latter can be considered as Gaussian distributed, given the large number of 
particles involved, while an exponential distribution is assumed for the interaction probability. 
The resulting distribution is asymmetric with a given degree of skewness depending on the 
considered parameters. It turns out from these considerations that a suitable distribution 
describing the fluctuation of a single shower observable x is a convolution of the exponential 
and gaussian distributions (exponentially modified gaussian (EMG)) [19]: 




(2.1) 
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Figure 1: X max -A^ t distribution for different primary nuclei at 10 eV predicted by the 
EposI.99 model. 



where A is the attenuation parameter of the exponential function, /i and a the mean and width 
parameters of the gaussian function, and erf the standard error function. The above model 
is based on physical considerations and provides a good representation of both simulated and 
real data. However, to our knowledge, little efforts have been carried out to describe the joint 
distribution of several shower observables and no analytical solutions exists to fit a mixture of 
EMG distributions. For this reason in this work we propose a different model, based on the 
multivariate skew-t (MST) distribution, to describe the fluctuations of m shower observables 
x. It is as flexible as model 2.1 in reproducing the skewness observed in the shower observables 
and, as we will show in section 3.1, it has the great advantage that a closed form exists for 
mixture fitting with the EM algorithm. 

Based on [16, 25, 26], we say that a random vector X follows a g-variate skew-t distri- 
bution /(x; 4>) with location vector /j,, covariance matrix S, skewness vector d and v degrees 
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of freedom if it can be represented by 

X = S\U\ + XJ 

where U |u,u; ~ N p (li + S\u\, E/u;), U\w ~ N(0,l/w) and TV ~ T(u/2,v/2). The joint 
distribution of X, £7 and is given by: 



R-l-wS 



/(x, it, to) = Cw ' e 



(2.2) 



where T(-) is the Gamma function and: 



S =-[(x - ll - &\u\) t YT x {k -ll- S\u\) + u 2 + v] 
R=(v+p + l)/2 



C={v/2y' 2 (2^) E ^ i |S| 1 / 2 r(i//2) 



Integrating out u and w from the joint distribution we get the expression of the multivariate 
skew-t distribution: 



h-R 



f(^4i) = 2CT{R)^ 7 ^I 1 [R, 



(2.3) 



where: 



a = -(l + S'^S) 
b = - l -(*-n)'V- 1 8 

c= ^(x- Ai ) / S- 1 (x-/x) + ^/2 

D = c — b 2 /a 



h(E, a) = J (1 + x 2 Y E dx = f A__| _ a ^{1/2, E, 3/2, -a 2 } 
in which 2-^1 denotes the generalized Gauss hypergeometric function, given by: 

u ; r(z-x)r(z-y) 

Finally, denotes the overall parameter of the distribution, that is = (ll, S, S, v). 
2.1 Model validation 

The model presented in the previous section has been validated with shower simulations 
generated with Conex v2r2.3 tool [20, 21] for three different hadronic models (QgsjetII 
[9], Sibyll 2.1 [10], Epos 1.99 [11]) and a large set of primary nuclei (p, He, C, N, O, Si, 
Ca, Fe). A fixed primary energy was assumed in the simulation, ranging from 10 17 to 10 20 
eV in step of log 10 E=0.5, and a zenith range from 0° to 90°. We restrict the analysis to the 
shower depth of maximum X max and the number of muons Nn at ground level above 1 GeV 
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Figure 2: Constraint region on the MST parameters from hadronic models as a function of 
the logarithm of the primary mass A. 



which represent the most sensitive observables to the primary composition. 
In Fig. 1 we report the distribution of the two parameters obtained from Epos simulations 
for different primary nuclei at an energy of 10 19 eV. The agreement with the MST model 
is remarkable. Similar plots are obtained for the other two hadronic models (not shown 
here). In Fig. 2 the fitted values of the MST parameters are reported for the three hadronic 
interaction models as a function of the logarithm of the nuclear mass. Although ll and 5] do 
not directly represent the mean and covariance of the distribution, it is noteworthy to retrieve 
the expected behavior of the means, decreasing with the mass for X max and increasing for 
Nu, and of the variances, decreasing with the nuclear mass. Interestingly, the skewness is 
found to decrease with the nuclear mass for the depth of maximum variable while assumes 
nearly constant values for the muon component. 

3 A mixture model for mass composition analysis 

In this section we describe the mixture model we designed for mass composition analysis. 
It has been developed in C++ with links to the Root tool [27] for numerical integration 
routines and to the R statistical tool [29] via the RInside/Rcpp interface [28] for multivariate 
mathematical functions. 

Assume we are provided with a set of N cosmic ray data observations X = (xi, . . . , xjv) 

coming from different nuclear species, where Xj = (X maXi , iV w ). In this case, the distribution 
of the random vector X can be modeled via a mixture /(x; 0) of K multivariate skew-t 
distributions: 

K 

/(x;0) = ^vr fc /(x;0 fc ) (3.1) 
k=l 

1 The N,j, observable has been scaled by an empirical parametrization g(8) — po + pid + P2O 2 to get rid of 
the zenith angle dependence. 
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where tt^ are the mixture weights (satisfying tt^ > and J^fcLi = 1> composition abun- 
dances), /(x;0 fc ) denotes the density of the k component (2.3) where cj> k = (fj, k , Sk, Vk) 
denotes the parameters of the fcth component and = {(fii, ■ ■ ■ , 4> k , tti, . . . , 7Tfc} denotes the 
overall parameter of the mixture. 

According to the maximum likelihood approach, the parameters in (3.1) can be esti- 
mated by maximizing the log-likelihood function: 



JV N 

log £(0; X) = log 11 9) = J2 /(*; *) ■ (3-2) 

4 = 1 1=1 



The maximization has to be carried out numerically as no analytical solution exists. However 
it has been shown in [16] that a closed form solution can be obtained using the EM algorithm. 
In next section we briefly summarize the parameter estimation procedure. 



C(9') , 



Figure 3: Sketch showing the situation of violating constraints. At iteration (t) constraints 
are assumed to be satisfied, while at the next iteration (t+1) they are violated. Given the 
ascent and continuous property of the likelihood in the EM it will exist a point (*) in between 
the two steps in which the violation occurs. The idea consists of retrieving such point and 
limit the EM update in the dashed gray region before. 



3.1 EM parameter estimation 

In the EM framework [24], the N observed data x=(xi,. . . ,xjv) are considered incomplete and 
U = (tti, • • • j un) and W = (u>i, . . . , wn) are unobservable latent variables. We introduce 
the missing data Z = (zi, . . . , zjv) such that Zj = (zn, . . . , zki) (i=l,...N) with 2^=1 if 

Xj comes from the fc-th component and Zik=0 otherwise. The set (X,Z,U, W) denotes the 

complete-data. Thus the complete-data log-likelihood C c (6) can be expressed as: 



log£ c (0) = log£l(7Tl, . . .,TT K ) + log £ 2 (/•*!, £l,<$l, • • • , fJ. K ,^K,S K ) + log£ 3 (^l, ...,U K ) 



- 6 - 



K N 



where 

log £l = ^ ^ Z ki log TTk 

k=l i=l 
if iV 

log £2 =X]5Z "y^sl^fe 1 ! ~ l°g( 27r ) + ^i( x i - A*fc - S k u i ) T 'E^ 1 (bxi - n k - S k Ui)] 

k=l i=l 
K N 

log£ 3 = ^2^2z ki {--[(p- l)\ogWi + Wiuj) + 
k=l i=l 

+ |K- Iog(yfe/2)] + lo g r(^/2) + (zv fc /2 - 1) log™*} 

The EM algorithm is first initialized by choosing a starting approximation 0^ for the model 
parameters and then proceeds by iterating two consecutive steps until convergence. At a 
given iteration t the E step computes the expected value Q(6\6^) of the complete log- 
likelihood, which is maximized in the M step with respect to to obtain a new parameter 
estimate 0^ t+l \ The following expectations are required to compute Q(6\0^): 

ir (t) f(^\e (t) ) 

E(Z ki W k \ Xi ) = P(Z ki = l\xi)E(W k \xi) = P(Z ki = l|xi)ei fW 
E{Z ki \U k \W k \ Xi ) = P{Z ki = l\xi)E(\U k \W k \xi) = P(Z ki = l\*i)e 2M 
E(Z kl UiW k \ Xi ) = P(Z ki = l\xi)E(U%W k \xi) = P(Z ki = l\y^)e m 
E(Z ki \ogW k \yLi) = P(Z ki = llxijEClogWfclxi) = P(Z ki = l\^)e 4M 

An analytical calculation of the above expectation terms e\ k i, e^u, e3,ki, ^iki nas been 
recently provided in [16]. Maximizing Q(0\6^), i.e. setting the derivatives with respect to 
6 equal to zero, yields the update estimates of the model parameters at the t + 1 iteration: 



v^Af (t) 

(t+i) _2^i=i r ki 



N 

V JV Jt) { (t) _g.Jt) s 



ki l,ik 

N 



Z^i=l T ki 8=1 

- e W 5 W fx- - u ( V - fx- - u (t) )5 (t) e (t) + 

+ e 3,fcj°fc °k J 

V-iV (t) (t) f _ (t)x 

2_/i=l 'fci c 3,ki 



(108(^/2) - ^(,r i} /2) + i)£r£) +e^(4!L - 4' 



at iv 

(*) 

8=1 8=1 
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The v update does not exist in closed form and have to be either estimated by solving 
numerically the above equation or specified in advance. The above update rules guarantee a 
monotonic increase of the log-likelihood. 

3.2 EM constrained 

In the case of the mass composition identification of cosmic rays a set of parameter constraints 
can be derived from physical considerations as well as from predictions obtained with the 
Heitler model [22, 23] or with the current hadronic models. We report in Fig. 2 the mean 
[/,, covariance matrix E, skewness 5 and degree of freedom v parameters as a function of 
the logarithm of the nuclear mass ln(^4) at an energy of 10 19 eV. The numerical values are 
obtained by fitting a single MST distribution to our simulated samples. The three black lines 
correspond to three different hadronic models, the purple area to the region constrained by 
the three models. As one can see, hadronic models provide stringent lower and upper bounds 
on the mixture parameters. If we therefore specify in advance the primary masses A k of the 
mixture groups to be determined from data and order them according to their primary mass 
{A k < Ak+i) we can define the following sets of constraints holding for all shower observables: 

• Mean constraints: a^ kj < fi k j — ^(k+i)j < b^ k] , Vk, j=l,. ■ ■ ,q where fj,^ denotes the 
jth component of the mean vector fi k , for suitable constants a Mfe . , b^ .. 

As a consequence of the higher interaction cross section of heavy nuclei with the air 
molecules with respect to light ones, an ordering constraint of the kind /ij ^ i s 
established among each nuclear component means at the same primary energy, e.g. 

4 msx < l*x~mvx or > anc ^ so on ^ or °t ner observables. A more stringent 

constraint, requiring the difference of each component means within a given bound 
[a^ ,b^.], is provided by the hadronic models. 

• Variance constraints: a kjj > o"(fc+l)#> a akj . < a kjj < b Ckj . Mk, j=l,...,q where a kjj 
denotes the jth diagonal component of the covariance matrix for suitable constants 

Due to the larger number of nucleons involved in the interactions, heavy nuclei exhibit 
smaller shower-to-shower fluctuations in all shower observables compared to light nuclei 
at the same primary energy, e.g. o"x max ^ ^CmL an< ^ 17 < ^tv" 1 - Moreover, hadronic 
models provides also a constraint bound [a ak .,b akj ] for each mixture component. 

• Skewness and degrees of freedom constraints: a$ kj < 5 k j < bs kj , a Vk < v k < b Uk V/c, 
j=l,. ■ ■ ,q where 6j~j denotes the jth component of the skewness vector 6j~, with as kj , 
b Skj , a Ukj , b Ukj constants. 

Useful bound regions on the skewness parameters 5 and v are provided by the hadronic 
models. 

When analyzing simulated or real data a larger tolerance region with respect to that 
defined by the models should be considered for several reasons. First we need to account for 
possible discrepancies of the data with respect to model predictions. Moreover, if we consider 
the non-null experimental resolution achieved in the measurement of the shower observables 
and that the nuclear species are not considered alone in analysis but instead grouped in 
few general groups (i.e. light-^4, intermediate- A and heavy- A) the variance to be considered 
for each group k is effectively larger. In Fig. 2 the tolerance region is represented by the 
gray area defined assuming a ±50% bound with respect to the model constraint region. To 
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incorporate the above constraints in the parameter space of the statistical model we consider 
the case of a given parameter 9 violating its constraints at a given iteration t + 1. At the 
previous iteration the constraints are assumed to be satisfied. This situation is sketched in 
Fig. 3. Due to the the continuous and monotonic properties of the likelihood function in the 
EM we can trace back the exact point 9* in which the constraint is violated after the (t + 1) 
update by introducing the following notation: 



9* = (1 - a)0® + a^ m) 



(3.3) 



with a real values in the range [0,1]. When a=0 the parameter estimated at iteration t is 
obtained, while the update at iteration t + 1 is obtained when a=l. The constraint violation 
occurs at an intermediate value a=a* . We note that when satisfies the constraints 9* 

satisfies the constraints Va. To effectively constrain the EM update it is sufficient to choose 
an arbitrary a < a* , e.g. a*/s (s > 1). In practice we are slowing down the EM algorithm 
with s controlling the slow-down rate. For the different types of constraints discussed above 
we have: 
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Variance constraint: 
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Skewness constraint: 
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Degrees of freedom constraint: 
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4 Method application to data 

In this section we report the fit results (parameter estimation accuracy and classification 
performances) obtained over random sets (A^ samp ; es =100) of simulated data, generated with 
the Epos 1.99 model for a reference energy of 10 19 eV. Two kind of samples have been 
produced with the following specifications: 

• Set I: 1000 two-dimensional observations of 3 nuclei (p, N, Fe) with relative abundances 
set to 0.5, 0.2, 0.3 respectively. 

• Set II: 1000 two-dimensional observations of 5 nuclei (p, He, N, Si, Fe) with rela- 
tive abundances set to 0.4, 0.1, 0.1, 0.1, 0.3 respectively. Each observation has been 
convoluted with a gaussian distribution of width o^et to take into account the effect 
of a non-zero experimental resolution. We assumed realistic resolutions for the two 
variables, namely o"d e t(X max )= 20 g/cm 2 and a comparable resolution for the number 
of muons a det {N^)= 3%. 
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Figure 4: BIC index computed for data sample I (left panel) and II (right panel) as a 
function of the number of components K for the three fitting assumptions: it fit (dotted 
lines), ir+fj, (dashed lines), 7r+^+X+<5 fit (solid lines). 



The fit of the data are repeated in three different conditions. In a first case we assume that 
the hadronic models are giving a reasonable representation of the data, hence we fixed all 
mixture parameters and determine the component fractions. This case, denoted as it fit in the 
following, corresponds to what has been typically done so far when analyzing real cosmic ray 
data. In a second case, denoted as (it + fi) fit, we assume that the models are not "trustable" 
in the mean parameters while continue to provide a reasonable description of the data for 
what concerns the shape of the shower observable distributions. This situation presumably 
corresponds to what has been recently observed in real data for the muon number observable. 
We therefore fitted the means and component weights and leave the other parameters fixed. 
Finally in the third case, denoted as (it + \i + E + 5) fit, we released all parameters in the fit 
but the number of degrees of freedom which is specified in advance. 

The parameters are initialized with multiple random starting values (~100) generated 
in the constrained space. The best starting values in terms of the maximum log-likelihood 
achieved are then used for the final fit estimate. 

Each optimization is considered to have converged if at a given iteration stage it satisfies 
the Aitken criterion [30], namely when < e (e=10~ 6 ), where C^ t+1 ^ is the log- 

likelihood at iteration t + 1 and C^ 1 ^ is the asymptotic log-likelihood at iteration t + 1 
[31]: 

r(t+l) r(t) 



with aW = (£(' +1 ) - £W)/(£W - £(*-!)) Aitken acceleration at iteration t. 
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Figure 5: Fit results for a data sample generated from set I (left panel) and II (right panel) 
for the full fit (vr+/i+£+<5) assumption. Top panels: Contour plots of the 3-component MST 
mixture model fitted to data (empty black dots); Bottom panels: Contour plots of each fitted 
MST component (red=p, green=iV, blue=i ? e). Colored dots indicate the true observations 
from each mass group (see text). 



- 11 - 



4.1 Choice of the number of components 

Several criteria are available in literature to estimate the number of clusters in the data. 
However, none of these generally leads to definite results, i.e. on the same data set different 
criteria can predict a different number of optimal clusters, and a user decision is anyway 
required according to the particular requirements of the analysis. In our case the goal is 
to provide a best fit of the data with the minimum number of components and the best 
performances, in terms of classification and mass abundance reconstruction, at least for the 
light-heavy groups or eventually for three category groups (light-intermediate- heavy) . 
It is however instructive to test at least one of the available criteria on the market to our 
problem, for example the Bayesian Information Criterion (bic). The bic index is defined 
as the maximized value of the likelihood log£ plus a penalty term accounting for possible 
overfitting of the data as the number of components (model parameters) increases: 

bic = —2 log C. -\- k log N (4.2) 

where k the number of degrees of freedom of the model, e.g. the number of free parameters, 
and N the number of observations. According to this definition, the model with the larger Bic 
value is the one to be preferred. We report in Fig. 4 the BIC index computed for data sample 
I (left panel) and II (right panel) for the three fitting assumptions, respectively reported with 
dotted, dashed and solid lines. As one can see, the criterion effectively manages to predict 
the real number of components present in the first data sample. Further, it clearly evidences 
that both data samples cannot be efficiently described with a 2-component model and that 
3 components are sufficient to provide an accurate description of data generated by a larger 
number of components, as in the second data set. We will therefore report in next section 
the fitting results relative to a 3 component fit. 

Table 1: Fit results for data set I 
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(n+fi+'E+S) Fit 

Component 


TV 




0.51 


0.19 


0.30 


0.50 


0.20 


0.30 


0.50 


0.22 


0.28 


Ml 




745.86 


716.13 


689.51 


751.11 


715.39 


689.71 


751.38 


709.23 


688.96 


ill 




1.23 


1.47 


1.65 


1.23 


1.45 


1.64 


1.22 


1.46 


1.65 


(Til 




310.56 


175.88 


90.19 


310.56 


175.88 


90.19 


518.71 


237.59 


84.18 


022 




0.027 


0.0066 


0.0047 


0.027 


0.0066 


0.0047 


0.032 


0.0064 


0.0044 


012 




-2.19 


0.03 


0.08 


-2.19 


0.03 


0.08 


-3.47 


-0.52 


0.17 


Si 




69.84 


33.38 


23.09 


69.84 


33.38 


23.09 


66.15 


41.78 


24.42 


5 2 




0.01 


0.03 


0.03 


0.01 


0.03 


0.03 


0.02 


0.03 


0.03 


V 




6.23 


27.07 


20.46 


6.23 


27.07 


20.46 


6.23 


27.07 


20.46 


c 






-4964.39 






-4487.93 






-4947.89 




Etot 






0.92 






0.91 






0.91 




£ 




0.93 


0.82 


0.97 


0.91 


0.81 


0.97 


0.90 


0.87 


0.94 



4.2 Fitting performances 

In Fig. 5 we report the results obtained by fitting a 3-components MST mixture model 
(p+N+Fe) to one particular sample generated from data set I (left panels) and data set II 
(right panels). For simplicity we present the results relative to the full fit situation, as the fit 
results obtained with the other two assumptions are visually similar, and report the values 
of the fitted parameters with the three fitting assumptions in Tables 1 and 2. The colored 
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Table 2: Fit results for data set II 







tv Fit 






(tt+aO Fit 




(n+fi+'E+S) Fit 


Par 




Component 






Component 






Component 




TV 


0.47 


0.20 


0.34 


0.49 


0.18 


0.33 


0.53 


0.18 


0.29 


Ml 


745.86 


716.13 


689.51 


743.79 


710.56 


690.34 


741.39 


703.41 


691.91 




1.23 


1.47 


1.65 


1.25 


1.48 


1.64 


1.28 


1.50 


1.65 


(Til 


310.56 


175.88 


90.19 


310.56 


175.88 


90.19 


518.71 


209.70 


113.42 


0~22 


0.027 


0.0066 


0.0047 


0.027 


0.0066 


0.0047 


0.033 


0.0066 


0.0046 


0~12 


-2.19 


0.03 


0.08 


-2.19 


0.03 


0.08 


-3.43 


-0.45 


0.17 


8l 


69.84 


33.38 


23.09 


69.84 


33.38 


23.09 


64.27 


37.27 


21.08 


82 


0.01 


0.03 


0.03 


0.01 


0.03 


0.03 


-0.001 


0.03 


0.03 


I' 


6.23 


27.07 


20.46 


6.23 


27.07 


20.46 


6.23 


27.07 


20.46 


c 




-4886.3 






-4881.5 






-4870.27 




£tot 




0.86 






0.88 






0.87 




e 


0.87 


0.66 


0.97 


0.92 


0.65 


0.97 


0.94 


0.64 


0.93 



lines represent the contour plots of the fitted model. We consider 3 general groups to be 
fitted: light (1<A<4), intermediate (12<A<28) and heavy (28<A<56). Lower panels show 
separately the three groups present in the data with the fitted components superimposed 
(light: red lines, intermediate: green, heavy: blue). As can be seen in both cases the fit 
nicely converges towards the expected solution defined by fitting a single MST model to each 
component alone. 

To evaluate the performances achieved for composition reconstruction we report in Figures 
6(a) and 6(c) the values of the fitted fractions averaged over the N samp i es generated samples 
for three mass groups and fitting conditions (tt fit: filled dots, w+fi fit: empty dots, ir+fi+'S+S 
fit: empty squares). The histograms shown with solid lines indicate the true composition 
fractions. The error bars refer to the obtained fraction RMS. For both data sets the method is 
able to resolve with good accuracy the three mass groups, with a slightly larger deviation for 
data set II with respect to the expected values, due to the helium and silicon contamination. 
Such small bias is however within the uncertainties of the method, found of the order of 0.05 
on the reconstructed fractions. 

4.3 Classification performances 

The fitted model can be used also for classification scopes. Each observation i is assigned 
to the mixture component k according to the maximum a posteriori probability t^. We can 
therefore compute for both data sets the efficiency e achieved for classification. We consider 
here, as above, the same 3 groups to be identified: light (1<A<4), intermediate (12<A<28) 
and heavy (28<A<56). In Tables 1 and 2 we reported the results for each mixture component 
separately in the case of the data sample analyzed in previous paragraph. 
The average classification performances achieved in both data sets are reported in Figures 
6(b) and 6(d) for the three groups (light: red, intermediate: green, heavy: blue). In data 
set I an overall classification efficiency e ~0.9 is observed for all fitting cases with a slightly 
larger error rate for the intermediate component (~20%). In data set II we have a consid- 
erably larger contamination coming from other species with respect to that used to fit the 
data and hence the event-by-event classification performances are significantly deteriorated 
for the intermediate component, with typical error rates of ~40%. However, focusing on the 
two extreme mass groups to be determined, we still manage to achieve a good classification 
with efficiencies around 90%. 
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(c) Composition Fit - Data Set II 



(d) Classification Fit - Data Set II 



Figure 6: Performance of composition abundance reconstruction and classification for data 
set I (upper panels) and II (bottom panels) for the three fitting conditions (w fit: filled dots, 
(ir+fJ-) fit: empty dots, (7r+/i+S+(5) fit: empty squares) and 3-component groups (light: red, 
intermediate: green, heavy: blue). 

The obtained misclassification errors are comparable with typical results obtained with com- 
pletely supervised methods, such as neural networks, in [4]. 



5 Summary 

In the present paper we proposed a new model, based on the multivariate skew-t function, 
to describe the joint distribution of cosmic ray shower observables at high energies. We have 
tested our model with simulated data for a large set of nuclei, focusing the attention on the 
depth of shower maximum and number of muons variables, which are the most discriminating 
in composition studies. We also developed a constrained clustering algorithm to reconstruct 
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the mass composition information of the data, on an event-by-event basis as well as on 
average (relative abundances). That is a very complicated task given the limited sensitivity 
of the shower observables to the primary mass and the strong group superposition due to 
shower-to-shower fluctuations. The designed algorithm allows to include different types of 
constraints coming from the hadronic model predictions. 

We tested the algorithm over samples of data generated with different relative abundances. 
In the ideal case of perfect measurement precision and negligible contaminations from other 
species with respect to that used to fit the data we achieved good classification performances, 
with error rates around 10%, comparable to that found with supervised methods (i.e. neural 
networks) in [4] . Component abundances and means can be reconstructed with good accuracy 
too. 

The classification performances drop off considerably in a more complicated situation in 
which a significant contamination from other nuclear species is present together with additive 
data fluctuations due to the experimental resolution. In this situation the discrimination of 
the data into three general groups (light-intermediate-heavy) is still feasible, with typical 
accuracy in the relative abundances ~0.05 depending on the degree of group contaminations. 
The algorithm, as it is, can be applied to more than two shower observables (i.e. signal 
rise time or asymmetries, muon production depth, . . . ) in a very straightforward way. The 
application to real cosmic ray data is easily done to, as we demonstrated in the case of data 
sample II, eventually explicitly including the effect of the real experimental resolution in the 
Monte Carlo templates and simplifying the model by ignoring the correlations relative to 
variables measured with different detectors. 
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